INTRODUCTION

Machine Learning: Classification - Managing the Quality Metric of Global Ecological Footprint

Logistic Regression is a Machine Learning algorithm used to make predictions to find the value of a dependent variable such as the condition of a tumor (malignant or benign), classification of email (spam or not spam), or admission into a university (admitted or not admitted) by learning from independent variables (various features relevant to the problem).

For example, for classifying an email, the algorithm will use the words in the email as features and based on that make a prediction whether the email is spam or not.

Logistic Regression is a supervised Machine Learning algorithm, which means the data provided for training is labeled i.e., answers are already provided in the training set. The algorithm learns from those examples and their corresponding answers (labels) and then uses that to classify new examples.

In mathematical terms, suppose the dependent variable is Y and the set of independent variables is X, then logistic regression will predict the dependent variable P(Y=1) as a function of X, the set of independent variables.

Logistic Regression can be divided into types based on the type of classification it does. With that in view, there are 3 types of Logistic Regression. Let’s talk about each of them:

  1. Binary Logistic Regression
  2. Multinomial Logistic Regression
  3. Ordinal Logistic Regression

The major difference between Logistic and Linear Regression is that Linear Regression is used to solve regression problems whereas Logistic Regression is used for classification problems. In regression problems, the target variable can have continuous values such as the price of a product, the age of a participant, etc. While, the classification problems deal with the prediction of target variable that can only have discrete values, for example, prediction of gender of a person, prediction of a tumor to be malignant or benign, etc.

To perform the logistic regression we shall be performing the following:

a. Importing libraries that will be needed. Should incase the libraries aren't install we shall be using the pip installation module.

b. Performing Data importation and preprocessing

c. Perform Exploratory visualization

d. Perform model building and training

e. Perform model evaluation

f. Conclusion

A. LIBRARY IMPORTATION

B. DATA IMPORTATION AND PREPROCESSING

Data has been read and read as 'df'. The data we are working with is of dimension 10000 rows and 14 features recorded as columns. A statistical description of the data set shows that 'stabf' is a factor variable of twol levels: Stable and Unstable. The entire data set are unique in their value recordings with thirteen float varaibles and one object variable. The data set is 100% complete with no missling, null or duplicate. The target varaible 'stabf' has 6380 unstable records and 3620 stable records. The 'stab' feature will be droped as it is directly related to the target varaible.

C. PERFORM EXPLORATORY VISUALIZATION

D. MODEL BUILDING AND TRAINING

In performing the model building, we will split the data into train and test at 80-20 and scale the train set. Also observing our data exploration and visulaization we observe that we are dealing with a binary logistic regression. From the visualization, the features are uniformly distributed, hence the need to scale by transformation.

MODEL EVALUATION

RANDOM FOREST CLASSIFIER

Random forests is a supervised learning algorithm which is used for classification and regression. It is flexible and easy to use algorithm that creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. It gives a good indicator of the feature importance. Random forests has various applications to include recommendation engines, image classification and feature selection.

RF FOR N=50

RF FOR N=300

RF FOR N=500

RF FOR N=1000

Algorithm evaluation

EXTRA TREE CLASSIFIER